Skip to content

Conversation

@Jintao-Huang
Copy link
Collaborator

No description provided.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial support for multimodal models within the Megatron framework. The primary goal is to enable the integration of vision components, specifically for models like Qwen2.5VL, and to ensure that model configurations are correctly handled during the conversion process from HuggingFace formats to Megatron. This lays the groundwork for training and deploying models that can process both text and visual data.

Highlights

  • Multimodal Model Integration: Introduces a new Qwen2_5VL_Vit module to support Vision Transformer (ViT) components for multimodal models within the Megatron framework.
  • Configuration Handling for Multimodal Models: Enhances the model conversion utility (convert.py) to correctly extract and use text_config for multimodal models when converting HuggingFace configurations to Megatron.
  • Improved Argument Passing: Ensures essential model metadata (model_info and model_meta) is passed as extra arguments during Megatron initialization, facilitating proper setup for multimodal architectures.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds initial support for multimodal models within Megatron. The changes introduce a new Vision Transformer (ViT) model wrapper for Qwen2.5-VL and update the model conversion script to correctly handle multimodal configurations. My review focuses on the new ViT wrapper, where I've identified a couple of critical issues: a missing import that will cause a NameError, and an incorrect implementation of the get_input_embeds method that will lead to a runtime error. I've provided specific suggestions to address these problems. The other changes appear logical for the goal of adding multimodal capabilities.

@Jintao-Huang Jintao-Huang changed the title [WIP] [megatron] support multimodal model [megatron] support multimodal model Aug 31, 2025
@Jintao-Huang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for multimodal models within the Megatron framework, with a specific implementation for Qwen2.5-VL. The changes are extensive, touching upon argument handling, model architecture, weight conversion utilities, and the training process to accommodate visual components. Key additions include a Qwen2_5VL_Vit module for the visual tower, modifications to GPTModel to integrate visual embeddings, and updates to the conversion scripts to handle multimodal weights. My review identifies a critical bug in the initialization of the visual processor and a high-severity issue in how positional embedding types are checked, which could lead to incorrect behavior.

model_dir = args.model_info.model_dir
kwargs = {'attn_impl': 'flash_attn'} if args.attention_backend.name == 'flash' else {}
with patch_device_map_meta(Qwen2_5_VLTextModel):
model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The processor is not stored after being loaded from get_model_tokenizer, but it's used later in get_inputs_embeds (e.g., self.processor.image_processor). This will raise an AttributeError at runtime. You should store the returned processor in self.processor.

Suggested change
model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)
model, self.processor = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)

rotary_seq_len,
packed_seq=packed_seq_params is not None and packed_seq_params.qkv_format == 'thd',
)
elif self.position_embedding_type in 'mrope':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The condition self.position_embedding_type in 'mrope' is likely a bug. It checks if self.position_embedding_type is one of the characters 'm', 'r', 'o', 'p', 'e', not the string 'mrope'. This works by coincidence for now because no other position embedding types share these characters, but it's fragile and incorrect. It should be changed to a direct string comparison for correctness and clarity.

Suggested change
elif self.position_embedding_type in 'mrope':
elif self.position_embedding_type == 'mrope':

@modelscope modelscope deleted a comment from gemini-code-assist bot Sep 1, 2025
@modelscope modelscope deleted a comment from gemini-code-assist bot Sep 1, 2025
@Jintao-Huang Jintao-Huang changed the title [megatron] support multimodal model [megatron] support multimodal model SFT (Full/LoRA) Sep 1, 2025
@Jintao-Huang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds support for multimodal model SFT in Megatron, including new arguments, model classes, and training logic. The changes are extensive and well-structured. However, I've identified a critical issue related to the initialization order of Megatron components. Specifically, patch_megatron_tokenizer is called before Megatron is initialized, which will cause a runtime error. This has a cascading effect, including an uninitialized processor in the new Qwen2_5VL_Vit class. I've also found a minor typo in an error message. Addressing the initialization order is crucial for the new functionality to work.

with torch.device('meta'):
self.model, self.processor = args.get_model_processor(**kwargs)
self._prepare_template()
patch_megatron_tokenizer(self.processor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The function patch_megatron_tokenizer calls megatron.training.get_args(), which requires Megatron to be initialized (i.e., initialize_megatron must have been called). However, patch_megatron_tokenizer is called here before self.trainer.train(), which is where initialize_megatron is eventually called. This will lead to a runtime error because megatron.training.get_args() will fail. The call to patch_megatron_tokenizer should be moved to a point after initialize_megatron has been executed, for example, within the MegatronTrainer after initialization.

Comment on lines 43 to 52
def __init__(self, config):
from transformers.models.qwen2_5_vl import Qwen2_5_VLTextModel
super().__init__(config)
args = get_args()
model_dir = args.model_info.model_dir
kwargs = {'attn_impl': 'flash_attn'} if args.attention_backend.name == 'flash' else {}
with patch_device_map_meta(Qwen2_5_VLTextModel):
model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)
self.model = model.visual.to('cuda')
self.model_config = model.config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The method get_inputs_embeds uses self.processor, but it is not initialized in the Qwen2_5VL_Vit class. This will cause an AttributeError at runtime. The processor should be made available to this class. Assuming the related initialization order issue is fixed and patch_megatron_tokenizer correctly sets args.tokenizer on the global Megatron arguments, you should initialize self.processor in the __init__ method.

Suggested change
def __init__(self, config):
from transformers.models.qwen2_5_vl import Qwen2_5_VLTextModel
super().__init__(config)
args = get_args()
model_dir = args.model_info.model_dir
kwargs = {'attn_impl': 'flash_attn'} if args.attention_backend.name == 'flash' else {}
with patch_device_map_meta(Qwen2_5_VLTextModel):
model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)
self.model = model.visual.to('cuda')
self.model_config = model.config
def __init__(self, config):
from transformers.models.qwen2_5_vl import Qwen2_5_VLTextModel
super().__init__(config)
args = get_args()
self.processor = args.tokenizer
model_dir = args.model_info.model_dir
kwargs = {'attn_impl': 'flash_attn'} if args.attention_backend.name == 'flash' else {}
with patch_device_map_meta(Qwen2_5_VLTextModel):
model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)
self.model = model.visual.to('cuda')
self.model_config = model.config

Comment on lines 180 to 181
raise NotImplementedError('Flash decoding uses precomputed cos and sin for RoPE, not implmented in '
'MultimodalRotaryEmbedding yet.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the error message. "implmented" should be "implemented".

Suggested change
raise NotImplementedError('Flash decoding uses precomputed cos and sin for RoPE, not implmented in '
'MultimodalRotaryEmbedding yet.')
raise NotImplementedError('Flash decoding uses precomputed cos and sin for RoPE, not implemented in '
'MultimodalRotaryEmbedding yet.')

@Jintao-Huang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant enhancements to support Supervised Fine-Tuning (SFT) for multimodal models within the Megatron framework, covering both full-parameter and LoRA tuning. Key additions include new arguments for controlling component freezing (LLM, ViT, aligner), a new MultimodalGPTModel class to handle multimodal inputs, and the implementation of Multimodal RoPE (mrope). The changes span across model definitions, data processing pipelines, training arguments, and documentation to provide comprehensive support for this new feature. The code is well-structured, and the changes are consistent. I have one suggestion to improve device handling for better portability.

kwargs = {'attn_impl': 'flash_attn'} if args.attention_backend.name == 'flash' else {}
with patch_device_map_meta(Qwen2_5_VLTextModel):
model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)
self.model = model.visual.to('cuda')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The device is hardcoded to 'cuda'. This could cause issues if running in an environment without CUDA or on a specific CUDA device in a multi-GPU setup. It's better to use a more flexible way to specify the device, for example by using torch.cuda.current_device() which respects the device set for the current process by Megatron's initialization.

Suggested change
self.model = model.visual.to('cuda')
self.model = model.visual.to(torch.cuda.current_device())

@Jintao-Huang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality by adding support for Supervised Fine-Tuning (SFT) of multimodal models using Megatron, for both full-parameter and LoRA tuning. The changes are comprehensive, touching upon model definitions, training arguments, data processing, and documentation. Key additions include the MultimodalGPTModel class, which elegantly wraps the existing GPTModel to incorporate visual components, and new arguments like freeze_llm, freeze_vit, and freeze_aligner to control training. The implementation for mRoPE support and updates to context parallelism for multimodal data are also notable. The documentation has been updated with new tutorials and parameter descriptions, which is very helpful. My review found a couple of minor inconsistencies in the new documentation examples, which I've commented on. Overall, this is a well-executed and valuable addition to the framework.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the inference script example, the model path appears to be incorrect. Based on the context, the model should be Qwen2.5-VL-7B-Instruct, but it's written as Qwen2.5-7B-Instruct, missing the VL part. This could be confusing for users.

Suggested change
--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
--model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \

MAX_PIXELS=1003520 \
CUDA_VISIBLE_DEVICES=0 \
swift infer \
--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There seems to be a typo in the model path in the inference script example. Based on the context, the model should be Qwen2.5-VL-7B-Instruct, but it's written as Qwen2.5-7B-Instruct here, missing the VL part. This could be confusing for users.

Suggested change
--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
--model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality by adding support for Supervised Fine-Tuning (SFT) of multimodal models using Megatron, for both full-parameter and LoRA training. The changes are extensive, touching upon model definitions, argument parsing, data collation, training loops, and weight conversion utilities to accommodate multimodal inputs. New documentation and example scripts are provided to guide users. The implementation includes patches for m-RoPE and updates to handle visual components like ViT and aligners. The code appears well-structured and the changes are consistent across the codebase. I've identified a couple of minor typos in the new documentation files that should be corrected for clarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be a typo in the model path. The model being trained is Qwen2.5-VL-7B-Instruct, so the path here should likely be megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf to match the output directory from the export step.

Suggested change
--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
--model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \

MAX_PIXELS=1003520 \
CUDA_VISIBLE_DEVICES=0 \
swift infer \
--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be a typo in the model path. The model being trained is Qwen2.5-VL-7B-Instruct, so the path here should likely be megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf to match the output directory from the export step.

Suggested change
--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
--model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \

@Jintao-Huang Jintao-Huang changed the title [megatron] support multimodal model SFT (Full/LoRA) [megatron] support multimodal model SFT/DPO (Full/LoRA) Sep 2, 2025
@Jintao-Huang Jintao-Huang changed the title [megatron] support multimodal model SFT/DPO (Full/LoRA) [megatron] support multimodal model CPT/SFT/DPO (Full/LoRA) Sep 2, 2025
@Jintao-Huang Jintao-Huang merged commit a161d54 into modelscope:main Sep 3, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants